{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1> Creating a custom Word2Vec embedding on your data </h1>\n",
    "\n",
    "This notebook illustrates:\n",
    "<ol>\n",
    "<li> Creating a training dataset\n",
    "<li> Running word2vec\n",
    "<li> Examining the created embedding\n",
    "<li> Export the embedding into a file you can use in other models\n",
    "<li> Training the text classification model of [txtcls2.ipynb](txtcls2.ipynb) with this custom embedding.\n",
    "</ol>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "BUCKET = 'alexhanna-dev-ml'\n",
    "PROJECT = 'alexhanna-dev'\n",
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['REGION'] = REGION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating a training dataset\n",
    "\n",
    "The training dataset simply consists of a bunch of words separated by spaces extracted from your documents. The words are simply in the order that they appear in the documents and words from successive documents are simply appended together. In other words, there is not \"document separator\".\n",
    "<p>\n",
    "The only preprocessing that I do is to replace anything that is not a letter or hyphen by a space.\n",
    "<p>\n",
    "Recall that word2vec is unsupervised. There is no label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import google.datalab.bigquery as bq\n",
    "\n",
    "query=\"\"\"\n",
    "SELECT\n",
    "  CONCAT( LOWER(REGEXP_REPLACE(title, '[^a-zA-Z $-]', ' ')), \n",
    "  \" \", \n",
    "  LOWER(REGEXP_REPLACE(text, '[^a-zA-Z $-]', ' '))) AS text\n",
    "FROM\n",
    "  `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "  LENGTH(title) > 100\n",
    "  AND LENGTH(text) > 100\n",
    "\"\"\"\n",
    "\n",
    "df = bq.Query(query).execute().result().to_dataframe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>reddit bookmarklets allow web site owners to c...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>why not let online ads fight it out in a geome...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>smashing the clock  bestbuy s  location and ho...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ask hn  can google aggregate everything you ve...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ask yc    think out loud  - like twitter justi...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text\n",
       "0  reddit bookmarklets allow web site owners to c...\n",
       "1  why not let online ads fight it out in a geome...\n",
       "2  smashing the clock  bestbuy s  location and ho...\n",
       "3  ask hn  can google aggregate everything you ve...\n",
       "4  ask yc    think out loud  - like twitter justi..."
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('word2vec/words.txt', 'w') as ofp:\n",
    "  for txt in df['text']:\n",
    "    ofp.write(txt + \" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is what the resulting file looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reddit bookmarklets allow web site owners to cheat to get mostly up votes  simple realistic example given   the idea is to associate a positive link and a negative link with your site  you would submit both to reddit  p based on the user s experience  you would switch him her to the positive negative link  p that way  happy users would vote up the positive link while unhappy users would vote down the negative link   your site now has a better chance of making the front page  p as an example  suppose your site has a game puzzle  p when the user visits the site via the positive or negative link  you redirect to the negative link  p if the user plays several levels of the game puzzle  then he she probably likes it and then you can switch him her to the positive link  why not let online ads fight it out in a geometric real-time game played by advertisers and consumers  the advertiser may display his her ad along with all the other ads currently on display    p larger ads have the disadvant\r\n"
     ]
    }
   ],
   "source": [
    "!cut -c-1000 word2vec/words.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running word2vec\n",
    "\n",
    "We can run the existing tutorial code as-is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/envs/py3env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "/usr/local/envs/py3env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd word2vec\n",
    "TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(\" \".join(tf.sysconfig.get_compile_flags()))') )\n",
    "TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(\" \".join(tf.sysconfig.get_link_flags()))') )\n",
    "g++ -std=c++11 \\\n",
    "  -shared word2vec_ops.cc word2vec_kernels.cc \\\n",
    "  -o word2vec_ops.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} \\\n",
    "  -O2 -D_GLIBCXX_USE_CXX11_ABI=0\n",
    "\n",
    "#   -I/usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public \\"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The actual evaluation dataset doesn't matter.  Let's just make sure to have some words in the input also in the eval. The analogy dataset is of the form \n",
    "<pre>\n",
    "Athens Greece Cairo Egypt\n",
    "Baghdad Iraq Beijing China\n",
    "</pre>\n",
    "i.e. four words per line where the model is supposed to predict the fourth given the first three. But we'll just make up a junk file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing word2vec/junk.txt\n"
     ]
    }
   ],
   "source": [
    "%%writefile word2vec/junk.txt\n",
    ": analogy-questions-ignored\n",
    "the user plays several levels\n",
    "of the game puzzle\n",
    "vote down the negative"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data file:  ./words.txt\n",
      "Vocab size:  889  + UNK\n",
      "Words per epoch:  2912\n",
      "Eval analogy file:  ./junk.txt\n",
      "Questions:  2\n",
      "Skipped:  1\n",
      "Epoch    1 Step      185: lr = 0.191 loss =  23.78 words/sec =      199\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch    2 Step      545: lr = 0.181 loss =  11.52 words/sec =      402\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch    3 Step      905: lr = 0.172 loss =   9.14 words/sec =      404\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch    4 Step     1265: lr = 0.163 loss =   7.78 words/sec =      394\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch    5 Step     1625: lr = 0.154 loss =   7.44 words/sec =      397\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch    6 Step     1984: lr = 0.145 loss =   6.64 words/sec =      399\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch    7 Step     2531: lr = 0.131 loss =   6.40 words/sec =      601\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch    8 Step     2891: lr = 0.122 loss =   6.05 words/sec =      404\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch    9 Step     3251: lr = 0.113 loss =   6.15 words/sec =      397\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch   10 Step     3611: lr = 0.104 loss =   6.07 words/sec =      393\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch   11 Step     4159: lr = 0.090 loss =   6.12 words/sec =      602\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch   12 Step     4331: lr = 0.085 loss =   6.12 words/sec =      201\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch   13 Step     4878: lr = 0.072 loss =   5.83 words/sec =      605\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch   14 Step     5237: lr = 0.062 loss =   6.08 words/sec =      404\r\n",
      "Eval    0/2 accuracy =  0.0%\n",
      "Epoch   15 Step     5597: lr = 0.053 loss =   5.96 words/sec =      401\r\n",
      "Eval    0/2 accuracy =  0.0%\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2018-09-13 15:57:12.157304: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA\n",
      "2018-09-13 15:57:12.167545: I word2vec_kernels.cc:200] Data file: ./words.txt contains 16439 bytes, 2912 words, 889 unique words, 889 unique frequent words.\n",
      "/usr/local/envs/py3env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd word2vec\n",
    "rm -rf trained\n",
    "python word2vec.py \\\n",
    "   --train_data=./words.txt --eval_data=./junk.txt --save_path=./trained \\\n",
    "   --min_count=1 --embedding_size=10 --window_size=2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examine the created embedding\n",
    "\n",
    "Let's load up the embedding file in TensorBoard.  Start up TensorBoard, switch to the \"Projector\" tab and then click on the button to \"Load data\".  Load the vocab.txt that is in the output directory of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.datalab.ml import TensorBoard\n",
    "TensorBoard().start('word2vec/trained')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, for example, is the word \"founders\" in context -- it's near doing, creative, difficult, and fight, which sounds about right ...  The numbers next to the words reflect the count -- we should try to get a large enough vocabulary that we can use --min_count=10 when training word2vec, but that would also take too long for a classroom situation. <img src=\"embeds.png\" />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for pid in TensorBoard.list()['pid']:\n",
    "    TensorBoard().stop(pid)\n",
    "    print('Stopped TensorBoard with pid {}'.format(pid))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export the embedding vectors into a text file\n",
    "\n",
    "Let's export the embedding into a text file, so that we can use it the way we used the Glove embeddings in txtcls2.ipynb.\n",
    "\n",
    "Notice that we have written out our vocabulary and vectors into two files.  We just have to merge them now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   890   8900 226962 word2vec/trained/vectors.txt\r\n",
      "   890   1780  10929 word2vec/trained/vocab.txt\r\n",
      "  1780  10680 237891 total\r\n"
     ]
    }
   ],
   "source": [
    "!wc word2vec/trained/*.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==> word2vec/trained/vectors.txt <==\r\n",
      "2.540961503982543945e-01 5.147245526313781738e-01 -1.672663241624832153e-01 -4.278961420059204102e-01 -1.801081560552120209e-02 -1.780755966901779175e-01 -4.884876906871795654e-01 -1.350466255098581314e-02 8.268681168556213379e-02 -4.263160824775695801e-01\r\n",
      "-2.084412276744842529e-01 -1.018680110573768616e-01 -4.349195361137390137e-01 4.004665911197662354e-01 -3.781259655952453613e-01 -3.250748813152313232e-01 -2.427831590175628662e-01 2.993280589580535889e-01 4.350312054157257080e-01 -1.009981334209442139e-01\r\n",
      "-4.563158750534057617e-01 3.323142826557159424e-01 1.393919438123703003e-01 6.663140654563903809e-02 4.376976191997528076e-01 4.744379520416259766e-01 -1.887555420398712158e-01 -3.682589828968048096e-01 9.675115346908569336e-02 2.453538328409194946e-01\r\n",
      "\r\n",
      "==> word2vec/trained/vocab.txt <==\r\n",
      "b'UNK' 0\r\n",
      "b'to' 99\r\n",
      "b'the' 98\r\n"
     ]
    }
   ],
   "source": [
    "!head -3 word2vec/trained/*.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "vocab = pd.read_csv(\"word2vec/trained/vocab.txt\", sep=\"\\s+\", header=None, names=('word', 'count'))\n",
    "vectors = pd.read_csv(\"word2vec/trained/vectors.txt\", sep=\"\\s+\", header=None)\n",
    "vectors = pd.concat([vocab, vectors], axis=1)\n",
    "del vectors['count']\n",
    "vectors.to_csv(\"word2vec/trained/embedding.txt.gz\", sep=\" \", header=False, index=False, index_label=False, compression='gzip')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'UNK' 0.2540961503982544 0.5147245526313781 -0.1672663241624832 -0.4278961420059204 -0.018010815605521202 -0.17807559669017792 -0.4884876906871796 -0.01350466255098581 0.08268681168556212 -0.4263160824775696\r\n",
      "b'to' -0.20844122767448423 -0.10186801105737686 -0.434919536113739 0.40046659111976624 -0.3781259655952454 -0.3250748813152313 -0.2427831590175629 0.29932805895805364 0.4350312054157257 -0.10099813342094419\r\n",
      "b'the' -0.4563158750534058 0.33231428265571594 0.1393919438123703 0.06663140654563904 0.43769761919975286 0.474437952041626 -0.1887555420398712 -0.3682589828968048 0.09675115346908568 0.2453538328409195\r\n",
      "\r\n",
      "gzip: stdout: Broken pipe\r\n"
     ]
    }
   ],
   "source": [
    "!zcat word2vec/trained/embedding.txt.gz | head -3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training model with custom embedding\n",
    "\n",
    "Now, you can use this embedding file instead of the Glove embedding used in [txtcls2.ipynb](txtcls2.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Copying file://word2vec/trained/embedding.txt.gz [Content-Type=text/plain]...\n",
      "/ [0 files][    0.0 B/ 85.3 KiB]                                                \r",
      "/ [1 files][ 85.3 KiB/ 85.3 KiB]                                                \r\n",
      "Operation completed over 1 objects/85.3 KiB.                                     \n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud storage cp word2vec/trained/embedding.txt.gz gs://${BUCKET}/txtcls2/custom_embedding.txt.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gs://alexhanna-dev-ml/txtcls2/trained_model us-central1 txtcls_180913_160510\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "CommandException: 1 files/objects could not be removed.\n",
      "CommandException: No URLs matched: txtcls/trainer/*.py\n",
      "ERROR: (gcloud.ml-engine.jobs.submit.training) Source directory [/content/datalab/training-data-analyst/courses/machine_learning/deepdive/09_sequence/txtcls] is not a valid directory.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "OUTDIR=gs://${BUCKET}/txtcls2/trained_model\n",
    "JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)\n",
    "echo $OUTDIR $REGION $JOBNAME\n",
    "gcloud storage rm --recursive --continue-on-error $OUTDIR\n",
    "gcloud storage cp txtcls1/trainer/*.py $OUTDIR\n",
    "gcloud ml-engine jobs submit training $JOBNAME \\\n",
    "   --region=$REGION \\\n",
    "   --module-name=trainer.task \\\n",
    "   --package-path=$(pwd)/txtcls1/trainer \\\n",
    "   --job-dir=$OUTDIR \\\n",
    "   --staging-bucket=gs://$BUCKET \\\n",
    "   --scale-tier=BASIC_GPU \\\n",
    "   --runtime-version=1.4 \\\n",
    "   -- \\\n",
    "   --bucket=${BUCKET} \\\n",
    "   --output_dir=${OUTDIR} \\\n",
    "   --glove_embedding=gs://${BUCKET}/txtcls2/custom_embedding.txt.gz \\\n",
    "   --train_steps=36000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
